Final Tutorial CMSC320 - Apoorv Bansal, Shubhankar Sachdev, Andrew Huang
The issue of gun violence is one of the most prominent ones in our nation today. Debates surrounding gun regulation and gun laws dominate our political atmosphere, and it is one of many topics that severely divide the citizens of the USA. In recent years, the media has focused more and more on incidents of gun violence such as mass shootings in public places like schools, and houses of worship. The goal of this tutorial is to explore the issue of gun violence in America and use the data science pipeline to help inform potential policy decisions in order to minimize this problem in the future.
!pip install folium
import re
import numpy as np
import matplotlib.pyplot as plt
import folium
import requests
from folium.plugins import MarkerCluster
from sklearn import datasets
from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
import pandas as pd
import math
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, KFold
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from scipy import stats
import random
The first two stages of the data science pipeline are the collection and tidying of data. Normally these are considered to be two separate steps, however our collection was relatively simple as we downloaded our data from the source, therefore we combined tidying and collection into one step.
Our collection method was to download the dataset from Dataset. Kaggle is a community of data scientists where users can find and share various datasets. Once we downloaded the dataset, we loaded into a Pandas DataFrame. A DataFrame is a tablelike data structure organized by rows and columns. They are extremely useful as reading the data is significantly easier than using other data structures. Additionally, Pandas has many libraries and functions that can be used to perform complex operations and manipulations to the DataFrame. More information about DataFrames can be found at DataFrames.
The next phase was the tidying of data. We began by dropping columns such as those that stored various urls that we knew would not be relevant to our analysis.
The next thing we wanted to do was make some of the columns more readable. The first thing we did was clean all of the lists. In the original dataset, a cell with a list was stored with a digit indicating an index followed by ::, followed by the data of that index. We wrote a simple function that would clean a string in this format and convert it to an actual list. We ran this on ever column with a list in the untidy format, and it resulted in the data being significantly more readable. Additionally, we simplified the "gun_type" column to reflext whether the gun was a pistol/shotgun or an automatic weapon, rather than assigning a specific gun model. This would make it easier for those who are not knowledgeable about guns to easily be able to determine the type of weapon used.
data = pd.read_csv('gun_violence.csv')
# Lets begin to tidy the data, going through the columns there are some we can drop
data = data.drop(columns = ['gun_stolen','participant_name','participant_name', 'incident_url', 'source_url','incident_url_fields_missing', 'location_description', 'state_house_district', 'state_senate_district', 'sources'], axis = 1)
# After dropping some columns lets begin to make the data a bit more usable in terms of participant_age
# First lets iterate through each row of the dataframe, we see the age is divided up into (int)::Age. We know the age
# is 2 digits long, so we can simply get the last 2 indeces of the string.
# Create a for loop to go through each row.
# Different Gun types: Handgun, AR-15 AK-47 shotgun Auto rifle
count = 0
# Returns a cleaned list from the messy input of a digit followed by ::
def clean_str(strIn):
strIn = strIn.split('||')
strList = []
for i in strIn:
regex = re.split('\d::', i)
strList.append(regex[-1])
return strList
for i, r in data.iterrows():
raw_age = str(r['participant_age'])
raw_gender = str(r['participant_gender'])
raw_ageGroup = str(r['participant_age_group'])
type_shooting = str(r['incident_characteristics'])
raw_status = str(r['participant_status'])
raw_participant_type = str(r['participant_type'])
gun_used = str(r['gun_type'])
if 'AK-47' in gun_used or 'AR-15' in gun_used or 'Auto' in gun_used:
data.at[count, 'gun_type'] = 'Automatic Gun Used'
else:
data.at[count, 'gun_type'] = 'Pistol/Shotgun'
g_list = clean_str(raw_gender)
data.at[count, 'participant_gender'] = g_list
group_list = clean_str(raw_ageGroup)
data.at[count, 'participant_age_group'] = group_list
clean_part_status = clean_str(raw_status)
clean_part_type = clean_str(raw_participant_type)
data.at[count, 'participant_status'] = clean_part_status
data.at[count, 'participant_type'] = clean_part_type
if 'Mass Shooting' in type_shooting:
data.at[count, 'incident_characteristics'] = 'Mass Shooting(4+ Deaths/Injuries)'
else:
data.at[count, 'incident_characteristics'] = 'Isolated Shooting(0-3 Deaths/Injuries'
# Split by || in case there are multiples, if not, this will be a single age in the form of (int)::Age
# Other wise, it will be a list of (int)::Age
age_list = raw_age.split('||')
age = []
for i in age_list:
# Gets the last 2 characters of the string.
lastTwo = i[-2:]
age.append(lastTwo)
data.at[count, 'participant_age'] = age
count = count + 1
# 37.8, -96], 4 coords for folium map
data['year'] = data.date.str.extract(r'([0-9][0-9][0-9][0-9])', expand=True)
data["year"] = pd.to_numeric(data["year"])
data['harmed'] = data['n_killed'] + data['n_injured']
data.head()
The next stage in the data science life cycle is the data visualization stage where we take our tidied data, and as the name implies, create visual elements such as graphs, maps, etc. to more easily and clearly see trends and patterns in our data.
Our first piece of visualization is to create an interactive map of shootings across the country. We utilized the folium library which is specialized at creating maps. It makes the creation of maps for visualization incredibly easy and has many features such as clustering, and easy zoom. More information about folium can be found at Folium.
Our interactive map will help us see how the shooting incidents are spread out across the United States. The clusters will show us where incidents are concentrated and will allow for us to zoom into regions for additional focus.
# Create an interactive map of shootings across each state, plot a sample since 270,000 points is too many.
# First make a dictionary of the states, then add the coordinates and the type of shooting, deaths and injured.
state_shootings = {}
for i, r in data.iterrows():
# some rows are missing lat/long values, so we do not want those rows for our interactive map.
if not math.isnan(r['latitude']) or not math.isnan(r['longitude']):
if r['state'] not in state_shootings:
# create a tuple with the following format : (lat, long, city, killed, injured)
state_shootings[r['state']] = [(r['latitude'], r['longitude'], r['city_or_county'], r['n_killed'], r['n_injured'])]
else:
state_shootings[r['state']].append((r['latitude'], r['longitude'], r['city_or_county'], r['n_killed'], r['n_injured']))
# Next we will take a sample of these tuples from each state and put them in a new dictionary.
state_shootings_sample = {}
for i in state_shootings:
sample_size = 500
if len(state_shootings[i]) < 500:
sample_size = len(state_shootings[i])
state_shootings_sample[i] = random.sample(state_shootings[i], sample_size)
map_osm = folium.Map(location=[37.8, -102], zoom_start=4)
# allows our map to cluster points for a clean visual of the map.
mc = MarkerCluster().add_to(map_osm)
for i in state_shootings_sample:
for j in state_shootings_sample[i]:
lat = j[0]
long = j[1]
# String for popup message when clicking a point.
popupStr = 'Town:' + str(j[2]) + '|| Death:' + str(j[3]) + '|| Injured:' + str(j[4])
mc.add_child(folium.CircleMarker(location=[lat, long], radius = 10, popup = popupStr, color = '#DC143C', fill_color = '#DC143C'))
map_osm.add_child(mc)
map_osm